{ggstatsplot}: Informative Statistical Visualizations

Indrajeet Patil

(18 Nov. 2024)

Genesis

Why a new package?

Life in the trenches (c. 2017)

(as an academic researcher)

Large-scale problems

  • Replication crisis:

“39% of effects were subjectively rated to have replicated the original result”1

  • Reporting errors:

“half of all published psychology papers contained at least one p-value that was inconsistent”2

  • Interpretation errors:

“in 72% of cases, nonsignificant results were misinterpreted [to mean] that effect was absent”3


Personal challenges

  • How to increase reproducibility?
  • How to avoid reporting errors?
  • How to make visualizing data effortless?
  • How to make statistical reporting more informative?
  • How to emphasize the importance of the effect?
  • How to interpret null results?
  • How to easily assess validity of model assumptions?

Proposal

Information-rich, ready-made statistical visualizations

Defaulting to a rich graphic

💡 Graphical summaries reveal problems not discernible from numerical statistics!

Avoiding cooking a new plot


The grammar of graphics is a powerful framework and can help you make any data visualization!


💡 Using ready-made plots lowers the activation energy for visualizing data!

Action Plan

{ggstatsplot} was born!

(started out as loose scripts, which were later consolidated into a package)

Example function

E.g., for hypothesis about group differences (ggbetweenstats())

ggbetweenstats(
  data  = iris,
  x     = Species,
  y     = Sepal.Length,
  title = "Distribution of sepal length across Iris species"
)

Important

Information-rich defaults

  • raw data + distributions
  • descriptive statistics
  • inferential statistics
  • effect size + uncertainty
  • pairwise comparisons
  • Bayesian hypothesis-testing
  • Bayesian estimation

Statistical approaches available

  • parametric
  • parametric
  • robust
  • Bayesian

And there is more!

Promised Land

Does it deliver?

Show, don’t tell

Without {ggstatsplot}

Pearson’s correlation test revealed that, across 142 participants, variable x was negatively correlated with variable y: \(t(140)=-0.76, p=0.446\). The effect size \((r=-0.06, 95\% CI [-0.23,0.10])\) was small, as per Cohen’s (1988) conventions. The Bayes Factor for the same analysis revealed that the data were 5.81 times more probable under the null hypothesis as compared to the alternative hypothesis. This can be considered moderate evidence (Jeffreys, 1961) in favor of the null hypothesis (absence of any correlation between x and y).

With {ggstatsplot}

✅ No need to worry about reporting or interpretation errors!

Simplified data analysis workflow


✅ Quick insight into data by combining visualization and modeling!

Thoughtful Defaults

Data Visualization

Statistical Reporting

{statsExpressions} (Patil, 2021): statistics engine for {ggstatsplot}

(Doorn et al., 2020; APA Manual)

✅ Encourages best practices in data visualization and statistical reporting!

A grain of salt

The “Golem of Prague” problem



Promotes mindless application of statistical tests.

Clunky API

  • Not a “real” {ggplot2} extension.
  • Not interactive.
  • Statistical proficiency needed.
  • No stable (1.0) release yet.

Pleasant Side Effects

Maybe the real treasure was the technical skills we picked up along the way!

Software Architecture

Breaking down the monolith: \(20K \rightarrow 1K\) lines of code

flowchart TD
    ggstatsplot[ggstatsplot]
    statsExpressions[statsExpressions]
    note["backend engine"]
    
    subgraph easystats[easystats]
        effectsize[effectsize]
        insight[insight]
        parameters[parameters]
        performance[performance]
        bayestestR[bayestestR]
    end
    
    %% Main dependencies
    ggstatsplot --> statsExpressions
    ggstatsplot --> dots[Other dependencies]
    
    %% Add note connecting to the main relationship
    note -.-> statsExpressions
    
    %% statsExpressions dependencies on easystats packages
    statsExpressions --> effectsize
    statsExpressions --> insight
    statsExpressions --> parameters
    statsExpressions --> performance
    statsExpressions --> bayestestR
    
    %% Styling using custom colors
    classDef main fill:#fff88,stroke:#333,stroke-width:2px
    classDef note fill:#ffffff,stroke:#333,stroke-width:1px,stroke-dasharray: 5 5
    
    class ggstatsplot main
    class note note

Team sport

While re-architecting {ggstatsplot}, I decided to contribute to upstream open-source dependencies:

  • joined the core {easystats} team and contributing to its ten component packages
  • became a co-maintainer of {ggsignif}
  • contributed to {WRS2} and {ggcorrplot}

Quality Assurance

CI Checks (GitHub Actions)

  • Unit tests (random-order)
  • Code coverage (100%)
  • Linting (0 lints)
  • Formatting (0 issues)
  • Documentation (website, no link rot, many examples)
  • CRAN checks (0 E, 0 W, 0 N)
  • Pre-commit hooks (0 issues)
  • Portability (Linux, macOS, Windows)
  • Robustness (dependencies, R versions)

Healthy and active code base

Team sport

While improving QA tools for {ggstatsplot}, in the spirit of open-source, I decided to contribute to upstream dependencies:

  • became co-author of {lintr} (linter) and {styler} (formatter)

Impact

I can haz users?!

User Love

Total downloads > 500K (97 percentile)

library(packageRank)
plot(
  cranDownloads("ggstatsplot", from = "2018-04-03", to = Sys.Date()),
  graphics = "ggplot2", smooth = TRUE
)

Total citations > 1000

Conclusion

Benefits of the {ggstatsplot} approach

{ggstatsplot} combines data visualization and statistical analysis in a single step.

It…

  • provides ready-made plots with information-rich defaults
  • minimizes the chances of making errors in statistical reporting
  • follows best practices in data visualization and statistical reporting
  • helps evaluate statistical analysis in the context of the underlying data
  • highlights the importance of the effect by providing effect size measures
  • provides an easy way to evaluate absence of an effect using Bayesian framework
  • extremely beginner-friendly

For more

Source code for these slides can be found on GitHub.


If you are interested in good programming and software development practices, check out my other slide decks.

Find me at…

Twitter

LikedIn

GitHub

Website

E-mail

Thank You 😊

Session information

sessioninfo::session_info(include_base = TRUE)
─ Session info ───────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.4.2 (2024-10-31)
 os       Ubuntu 22.04.5 LTS
 system   x86_64, linux-gnu
 hostname fv-az1445-324
 ui       X11
 language (EN)
 collate  C.UTF-8
 ctype    C.UTF-8
 tz       UTC
 date     2024-11-13
 pandoc   3.5 @ /opt/hostedtoolcache/pandoc/3.5/x64/ (via rmarkdown)
 quarto   1.6.33 @ /usr/local/bin/quarto

─ Packages ───────────────────────────────────────────────────────────────────
 package          * version     date (UTC) lib source
 base             * 4.4.2       2024-10-31 [3] local
 BayesFactor        0.9.12-4.7  2024-01-24 [1] RSPM
 bayestestR         0.15.0      2024-10-17 [1] RSPM
 bitops             1.0-9       2024-10-03 [1] RSPM
 BWStest            0.2.3       2023-10-10 [1] RSPM
 cachem             1.1.0       2024-05-16 [1] RSPM
 cli                3.6.3       2024-06-21 [1] RSPM
 coda               0.19-4.1    2024-01-31 [1] RSPM
 colorspace         2.1-1       2024-07-26 [1] RSPM
 compiler           4.4.2       2024-10-31 [3] local
 correlation        0.8.6       2024-10-26 [1] RSPM
 cranlogs           2.1.1       2019-04-29 [1] RSPM
 curl               6.0.0       2024-11-05 [1] RSPM
 data.table         1.16.2      2024-10-10 [1] RSPM
 datasets         * 4.4.2       2024-10-31 [3] local
 datawizard         0.13.0      2024-10-05 [1] RSPM
 digest             0.6.37      2024-08-19 [1] RSPM
 dplyr              1.1.4       2023-11-17 [1] RSPM
 effectsize         0.8.9       2024-07-03 [1] RSPM
 evaluate           1.0.1       2024-10-10 [1] RSPM
 fansi              1.0.6       2023-12-08 [1] RSPM
 farver             2.1.2       2024-05-13 [1] RSPM
 fastmap            1.2.0       2024-05-15 [1] RSPM
 generics           0.1.3       2022-07-05 [1] RSPM
 ggplot2          * 3.5.1       2024-04-23 [1] RSPM
 ggrepel            0.9.6       2024-09-07 [1] RSPM
 ggsignif           0.6.4       2022-10-13 [1] RSPM
 ggstatsplot      * 0.12.5.9000 2024-11-12 [1] Github (IndrajeetPatil/ggstatsplot@450fa64)
 glue               1.8.0       2024-09-30 [1] RSPM
 gmp                0.7-5       2024-08-23 [1] RSPM
 graphics         * 4.4.2       2024-10-31 [3] local
 grDevices        * 4.4.2       2024-10-31 [3] local
 grid               4.4.2       2024-10-31 [3] local
 gtable             0.3.6       2024-10-25 [1] RSPM
 htmltools          0.5.8.1     2024-04-04 [1] RSPM
 httr               1.4.7       2023-08-15 [1] RSPM
 insight            0.20.5      2024-10-02 [1] RSPM
 jsonlite           1.8.9       2024-09-20 [1] RSPM
 knitr              1.49        2024-11-08 [1] RSPM
 kSamples           1.2-10      2023-10-07 [1] RSPM
 labeling           0.4.3       2023-08-29 [1] RSPM
 lattice            0.22-6      2024-03-20 [3] CRAN (R 4.4.2)
 lifecycle          1.0.4       2023-11-07 [1] RSPM
 lubridate          1.9.3       2023-09-27 [1] RSPM
 magrittr           2.0.3       2022-03-30 [1] RSPM
 MASS               7.3-61      2024-06-13 [3] CRAN (R 4.4.2)
 Matrix             1.7-1       2024-10-18 [3] CRAN (R 4.4.2)
 MatrixModels       0.5-3       2023-11-06 [1] RSPM
 memoise            2.0.1       2021-11-26 [1] RSPM
 methods          * 4.4.2       2024-10-31 [3] local
 mgcv               1.9-1       2023-12-21 [3] CRAN (R 4.4.2)
 multcompView       0.1-10      2024-03-08 [1] RSPM
 munsell            0.5.1       2024-04-01 [1] RSPM
 mvtnorm            1.3-2       2024-11-04 [1] RSPM
 nlme               3.1-166     2024-08-14 [3] CRAN (R 4.4.2)
 packageRank      * 0.9.4       2024-11-13 [1] RSPM
 paletteer          1.6.0       2024-01-21 [1] RSPM
 parallel           4.4.2       2024-10-31 [3] local
 parameters         0.23.0      2024-10-18 [1] RSPM
 patchwork          1.3.0       2024-09-16 [1] RSPM
 pbapply            1.7-2       2023-06-27 [1] RSPM
 performance        0.12.4      2024-10-18 [1] RSPM
 pillar             1.9.0       2023-03-22 [1] RSPM
 pkgconfig          2.0.3       2019-09-22 [1] RSPM
 pkgsearch          3.1.3       2023-12-10 [1] RSPM
 PMCMRplus          1.9.12      2024-09-08 [1] RSPM
 prismatic          1.1.2       2024-04-10 [1] RSPM
 purrr              1.0.2       2023-08-10 [1] RSPM
 R.methodsS3        1.8.2       2022-06-13 [1] RSPM
 R.oo               1.27.0      2024-11-01 [1] RSPM
 R.utils            2.12.3      2023-11-18 [1] RSPM
 R6                 2.5.1       2021-08-19 [1] RSPM
 Rcpp               1.0.13-1    2024-11-02 [1] RSPM
 RCurl              1.98-1.16   2024-07-11 [1] RSPM
 rematch2           2.1.2       2020-05-01 [1] RSPM
 rlang              1.1.4       2024-06-04 [1] RSPM
 rmarkdown          2.29        2024-11-04 [1] RSPM
 Rmpfr              0.9-5       2024-01-21 [1] RSPM
 scales             1.3.0       2023-11-28 [1] RSPM
 sessioninfo        1.2.2.9000  2024-11-10 [1] Github (r-lib/sessioninfo@37c81af)
 splines            4.4.2       2024-10-31 [3] local
 stats            * 4.4.2       2024-10-31 [3] local
 statsExpressions   1.6.1       2024-10-31 [1] RSPM
 stringi            1.8.4       2024-05-06 [1] RSPM
 stringr            1.5.1       2023-11-14 [1] RSPM
 sugrrants          0.2.9       2024-03-12 [1] RSPM
 SuppDists          1.1-9.8     2024-09-03 [1] RSPM
 tibble             3.2.1       2023-03-20 [1] RSPM
 tidyr              1.3.1       2024-01-24 [1] RSPM
 tidyselect         1.2.1       2024-03-11 [1] RSPM
 timechange         0.3.0       2024-01-18 [1] RSPM
 tools              4.4.2       2024-10-31 [3] local
 utf8               1.2.4       2023-10-22 [1] RSPM
 utils            * 4.4.2       2024-10-31 [3] local
 vctrs              0.6.5       2023-12-01 [1] RSPM
 withr              3.0.2       2024-10-28 [1] RSPM
 xfun               0.49        2024-10-31 [1] RSPM
 yaml               2.3.10      2024-07-26 [1] RSPM
 zeallot            0.1.0       2018-01-28 [1] RSPM

 [1] /home/runner/work/_temp/Library
 [2] /opt/R/4.4.2/lib/R/site-library
 [3] /opt/R/4.4.2/lib/R/library
 * ── Packages attached to the search path.

──────────────────────────────────────────────────────────────────────────────

Appendix

Examples of other functions

ggwithinstats()

Hypothesis about group differences: repeated measures design

ggwithinstats(
  data = WRS2::WineTasting,
  x = Wine,
  y = Taste
)

Important

✏️ Defaults

  • raw data + distributions
  • descriptive statistics
  • inferential statistics
  • effect size + uncertainty
  • pairwise comparisons
  • Bayesian hypothesis-testing
  • Bayesian estimation

Statistical approaches available

  • parametric
  • parametric
  • robust
  • Bayesian

gghistostats()

Distribution of a numeric variable

gghistostats(
  data = movies_long,
  x = budget,
  test.value = 30 
)

Important

✏️ Defaults

  • counts + proportion for bins
  • descriptive statistics
  • inferential statistics
  • effect size + uncertainty
  • pairwise comparisons
  • Bayesian hypothesis-testing
  • Bayesian estimation

Statistical approaches available

  • parametric
  • parametric
  • robust
  • Bayesian

ggdotplotstats()

Labeled numeric variable

ggdotplotstats(
  data = movies_long,
  x = budget,
  y = genre,
  test.value = 30 
)

Important

✏️ Defaults

  • descriptive statistics
  • inferential statistics
  • effect size + uncertainty
  • pairwise comparisons
  • Bayesian hypothesis-testing
  • Bayesian estimation

Statistical approaches available

  • parametric
  • parametric
  • robust
  • Bayesian

ggscatterstats()

Hypothesis about correlation: Two numeric variables

ggscatterstats(
  data = movies_long,
  x = budget,
  y = rating
)

Important

✏️ Defaults

  • joint distribution
  • marginal distribution
  • effect size + uncertainty
  • pairwise comparisons
  • Bayesian hypothesis-testing
  • Bayesian estimation

Statistical approaches available

  • parametric
  • parametric
  • robust
  • Bayesian

ggcorrmat()

Hypothesis about correlation: Multiple numeric variables

ggcorrmat(dplyr::starwars)

Important

✏️ Defaults

  • inferential statistics
  • effect size + uncertainty
  • careful handling of NAs
  • partial correlations

Statistical approaches available

  • parametric
  • parametric
  • robust
  • Bayesian

ggpiestats()

Hypothesis about composition of categorical variables

ggpiestats(
  data = mtcars,
  x = am,
  y = cyl
)

Important

✏️ Defaults

  • descriptive statistics
  • inferential statistics
  • effect size + uncertainty
  • goodness-of-fit tests
  • Bayesian hypothesis-testing
  • Bayesian estimation

ggbarstats()

Hypothesis about composition of categorical variables

ggbarstats(
  data = mtcars,
  x = am,
  y = cyl
)

Important

✏️ Defaults

  • descriptive statistics
  • inferential statistics
  • effect size + uncertainty
  • goodness-of-fit tests
  • Bayesian hypothesis-testing
  • Bayesian estimation

ggcoefstats()

Hypothesis about regression coefficients

mod <- lm(
  formula = rating ~ mpaa,
  data = movies_long
)

ggcoefstats(mod)

Important

✏️ Defaults

  • estimate + uncertainty
  • inferential statistics (\(t\), \(z\), \(F\), \(\chi^2\))
  • model fit indices (AIC + BIC)

Supports all regression models supported in {easystats} ecosystem.

Meta-analysis is also supported!

grouped_ variants

Iterating over a grouping variable

grouped_ functions

grouped_ggpiestats(
  data = mtcars,
  x = cyl,
  grouping.var = am 
)

Available grouped_ variants:

  • grouped_ggbetweenstats()
  • grouped_ggwithinstats()
  • grouped_gghistostats()
  • grouped_ggdotplotstats()
  • grouped_ggscatterstats()
  • grouped_ggcorrmat()
  • grouped_ggpiestats()
  • grouped_ggbarstats()

Customizability

“What if I don’t like the default plots?” 🤔

Modify the look 🎨

By changing theme and palette

ggbetweenstats(
  data = movies_long,
  x = mpaa,
  y = rating,
  ggtheme = ggthemes::theme_economist(), 
  palette = "Darjeeling2", 
  package = "wesanderson" 
)

By using {ggplot2} functions

ggbetweenstats(
  data = mtcars,
  x = am,
  y = wt,
  type = "bayes"
) +
  scale_y_continuous(sec.axis = dup_axis()) 

Too much information 🙈

Get only plots:

ggbetweenstats(
  data = iris,
  x = Species,
  y = Sepal.Length,
  # turn off statistical analysis
  centrality.plotting = FALSE, 
  results.subtitle = FALSE, 
  bf.message = FALSE, 
  # turn off pairwise comparisons
  pairwise.display = "none" 
)

Get only expressions:

stats_expr <- ggpiestats(
  Titanic_full, Survived, Sex,
) %>% 
  extract_subtitle()

ggiraphExtra::ggSpine( 
  data = Titanic_full,
  aes(x = Sex, fill = Survived)
) +
  labs(subtitle = stats_expr)  

{ggstatsplot}: Details about statistical reporting

Supports different statistical approaches

Note

Functions Description Parametric Non-parametric Robust Bayesian
ggbetweenstats() Between group comparisons
ggwithinstats() Within group comparisons
gghistostats(), ggdotplotstats() Distribution of a numeric variable
ggcorrmat() Correlation matrix
ggscatterstats() Correlation between two variables
ggpiestats(), ggbarstats() Association between categorical variables NA NA
ggpiestats(), ggbarstats() Equal proportions for categorical variable levels NA NA
ggcoefstats() Regression modeling
ggcoefstats() Random-effects meta-analysis NA

Toggling statistical approaches 🔀

Parametric

# anova
ggbetweenstats(
  data = mtcars,
  x = cyl,
  y = wt,
  type = "p" 
)

# correlation analysis
ggscatterstats(
  data = mtcars,
  x = wt,
  y = mpg,
  type = "p" 
)

# t-test
gghistostats(
  data = mtcars,
  x = wt,
  test.value = 2,
  type = "p" 
)

Non-parametric

# anova
ggbetweenstats(
  data = mtcars,
  x = cyl,
  y = wt,
  type = "np" 
)

# correlation analysis
ggscatterstats(
  data = mtcars,
  x = wt,
  y = mpg,
  type = "np" 
)

# t-test
gghistostats(
  data = mtcars,
  x = wt,
  test.value = 2,
  type = "np" 
)

Alternative: Pure Pain

Hunting for packages

📦 for inferential statistics ({stats})
📦 computing effect size + CIs ({effectsize})
📦 for descriptive statistics ({skimr})
📦 pairwise comparisons ({multcomp})
📦 Bayesian hypothesis testing ({BayesFactor})
📦 Bayesian estimation ({bayestestR})
📦 …

Inconsistent APIs

🤔 accepts data frame, vector, matrix?
🤔 long/wide format data?
🤔 works with NAs?
🤔 returns data frame, vector, matrix?
🤔 works with tibbles?
🤔 has all necessary details?
🤔 …

Footnotes

  1. Open Science Collaboration, Science, 2015

  2. (Nuijten et al., Behavior Research Methods, 2016)

  3. (Aczel et al., AMPPS, 2018)